Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): Add Table.sample #7377

Merged
merged 9 commits into from
Oct 17, 2023
Merged

feat(api): Add Table.sample #7377

merged 9 commits into from
Oct 17, 2023

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Oct 16, 2023

This adds a new method Table.sample for selecting a random sample of rows from a table.

The method takes the following arguments:

  • fraction: a float between 0 and 1 representing the fraction of rows to select
  • method: either "row" or "block" (defaults to "row") describing how to perform the sample. Blockwise sampling my be more efficient for some backends at the potential cost of correctness
  • seed: a random seed to use for repeatability

When possible this is compiled as a TABLESAMPLE clause, falling back to t.filter(ibis.random() <= fraction) if that's not available. However, there are a few backends where we could use a tablesample but currently don't:

  • postgres and impala: these backends support TABLESAMPLE, but only for physical tables (not subqueries, views, ...). For now we support these backends via rewriting to t.filter(random() <= fraction). I think we should be able to support compiling Table.sample to optionally use a TABLESAMPLE when possible for these backends, but that's not implemented here.
  • bigquery: same, but also only supports blockwise sampling. I've left this backend unimplemented for now.
  • snowflake: should be supportable, just didn't in this PR since I can't currently test against snowflake.
  • clickhouse and oracle: these backends have a sampling syntax, but it only works on physical tables with certain metadata (indices, sampling setup, ...). I don't think we can support their sampling syntax without losing the ability to sample arbitrary table expressions.

Closes #7139.

@cpcloud cpcloud added this to the 7.1 milestone Oct 17, 2023
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, nothing blocking.

This is great 🎉, thank you!

ibis/expr/operations/relations.py Outdated Show resolved Hide resolved
ibis/expr/rewrites.py Show resolved Hide resolved
ibis/expr/types/relations.py Show resolved Hide resolved
@jcrist jcrist enabled auto-merge (rebase) October 17, 2023 15:26
@cpcloud cpcloud added the feature Features or general enhancements label Oct 17, 2023
@jcrist jcrist merged commit 51027d9 into ibis-project:master Oct 17, 2023
81 checks passed
@jcrist jcrist deleted the tablesample branch October 17, 2023 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: sampling from table expressions
2 participants